---
id: tutorial-metrics-deepeval
title: Defining Metrics in Deepeval
sidebar_label: Defining Metrics in Deepeval
---

In this section, we’ll dive into the range of **LLM evaluation metrics** available in Deepeval and select those that are most relevant for evaluating our medical chatbot. Deepeval offers 4 types of metrics:

- **LLM System Metrics**: Evaluate single LLM input-output pairs.
- **Conversational Metrics**: Assess entire LLM interactions.
- **Custom Metrics**: Allow for personalized evaluation using user-defined evaluation criteria.
- **Multimodal Metrics**: Analyze systems handling multiple types of data, such as text and images.

:::info
**LLM-based scorers**, or using **LLMs as judges**, are among the most effective evaluation metrics. To learn more about this use case, consider reading the articles on [Deepeval’s system metrics](https://www.confident-ai.com) and [chain of thought prompting](https://www.confident-ai.com).
:::

For this tutorial, we'll concentrate on LLM system, conversational, and custom metrics, as our use case revolves around a chatbot, which does not involve image processing.

## Setting Up

To get started with Deepeval, first install the library using pip. Open your terminal and run:

```bash
pip install deepeval
```

Next, login to Confident AI. This will allow anyone in your team to easily view, compare, collaborate, analyze, and iterate on your evaluation results without the need to be technical. Use the following command to log in:

```bash
deepeval login
```

## LLM System Metrics

**Deepeval's LLM system metrics** include 5 RAG metrics, as well as [hallucination](/docs/metrics-hallucination), [summarization](/docs/metrics-summarization), and [tool correctness](/docs/metrics-tool-correctness). Since our medical chatbot utilizes a RAG engine and various functional tools, we will employ all [5 RAG metrics](/guides/guides-rag-evaluation) along with Tool Correctness. We’ll also be focusing on Hallucination, since any inaccuracies in medical diagnosis can be very harmful.

Let’s begin by defining our key metrics. Deepeval's LLM metrics include [Answer Relevancy](/docs/metrics-answer-relevancy\*, [Contextual Precision](/docs/metrics-contextual-precision), [Contextual Recall](/docs/metrics-contextual-recall), [Contextual Relevancy](/docs/metrics-contextual-relevancy), and [Faithfulness](/docs/metrics-faithfulness).

```python
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
    ToolCorrectnessMetric,
    HallucinationMetric,
)

model = "gpt-4"
include_reason = True
threshold = 0.7

answer_relevancy_metric = AnswerRelevancyMetric(
    threshold=threshold,
    model=model,
    include_reason=include_reason
)

faithfulness_metric = FaithfulnessMetric(
    threshold=threshold,
    model=model,
    include_reason=include_reason
)

contextual_precision_metric = ContextualPrecisionMetric(
    threshold=threshold,
    model=model,
    include_reason=include_reason
)

contextual_recall_metric = ContextualRecallMetric(
    threshold=threshold,
    model=model,
    include_reason=include_reason
)

contextual_relevancy_metric = ContextualRelevancyMetric(
    threshold=threshold,
    model=model,
    include_reason=include_reason
)

tool_correctness_metric = ToolCorrectnessMetric()

hallucination_metric = HallucinationMetric(
    threshold=threshold,
    model=model,
    include_reason=include_reason
)
```

Next, we'll define our **Tool Correctness** and **Hallucination** metrics:

```python
from deepeval.metrics import (
    ToolCorrectnessMetric,
    HallucinationMetric,
)

tool_correctness_metric = ToolCorrectnessMetric()

hallucination_metric = HallucinationMetric(
    threshold=threshold,
    model=model,
    include_reason=include_reason
)
```

:::note
In the examples above, we simply use `"gpt-4"` as the evaluation model, which requires an `OPENAI_API_KEY`. If you're interested in using non-OpenAI models for evaluation, you'll want to visit this [guide on adding custom models](/guides/guides-using-custom-llms).
:::

## Conversational Metrics

Additionally, we’ll be utilizing all of Deepeval’s **"conversational metrics"** to evaluate entire conversation threads between users and our medical chatbot. These metrics include:

- [Role Adherence](/docs/metrics-role-adherence)
- [Knowledge Retention](/docs/metrics-knowledge-retention)
- [Conversation Completeness](/docs/metrics-conversation-completeness)
- [Conversation Relevancy](/docs/metrics-conversation-relevancy)

:::info
You'll need to define `chatbot_role` inside your `ConversationalTestCase` when creating the Role Adherence metric. We’ll learn more about this in the following sections.
:::

```python
from deepeval.metrics import (
    RoleAdherenceMetric,
    KnowledgeRetentionMetric,
    ConversationCompletenessMetric,
    ConversationRelevancyMetric,
)

conversational_threshold = 0.7

role_adherence_metric = RoleAdherenceMetric(
    threshold=conversational_threshold,
    model=model,
    include_reason=include_reason
)

knowledge_retention_metric = KnowledgeRetentionMetric(
    threshold=conversational_threshold,
    model=model,
    include_reason=include_reason
)

conversation_completeness_metric = ConversationCompletenessMetric(
    threshold=conversational_threshold,
    model=model,
    include_reason=include_reason
)

conversation_relevancy_metric = ConversationRelevancyMetric(
    threshold=conversational_threshold,
    model=model,
    include_reason=include_reason
)
```

## Custom Metrics

Finally, we’ll be defining a **"custom metric"** for our medical chatbot. Deepeval’s `GEval` metric allows us to define a custom metric for any use case by simply either providing a string representing the evaluation criteria or alternately by a list of strings representing your evaluation steps.

:::note
You’ll need to **iterate and fine-tune these criteria** to minimize bias during evaluation. You can learn more about how to fine-tune your custom metrics according to [this guide](/guides/guides-answer-correctness-metric).
:::

### Diagnosis Specificity

The Diagnosis Specificity Metric evaluates how well the chatbot narrows down the diagnosis based on the symptoms provided. It rewards specificity when symptoms justify a detailed diagnosis, ensuring the chatbot provides meaningful and actionable responses.

```python
from deepeval.text_case_import import LLMTestCaseParams

diagnosis_specificity_metric = Granular(
    name="Diagnosis Specificity",
    criteria="Evaluate how well the diagnosis provided matches the most specific and accurate diagnosis based on the given symptoms.",
    evaluation_params={
        "input": LLMTestCaseParams.INPUT,
        "actual_output": LLMTestCaseParams.ACTUAL_OUTPUT,
    }
)
```

### Overdiagnosis

The Overdiagnosis Metric ensures that the chatbot avoids providing overly specific diagnoses when the symptoms are insufficient. This helps to prevent unnecessary anxiety or inappropriate medical advice.

```python
from deepeval.text_case_import import LLMTestCaseParams

overdiagnosis_metric = Granular(
    name="Overdiagnosis",
    criteria="Evaluate whether the diagnosis is overly specific given the symptoms, avoiding unnecessary precision that could lead to overdiagnosis.",
    evaluation_params={
        "input": LLMTestCaseParams.INPUT,
        "actual_output": LLMTestCaseParams.ACTUAL_OUTPUT,
    }
)
```
